Validation using DIGEST

Notebook to recreate the results presented in the paper. Keep in mind, that the results can slighty differentiate as the random background will not create the exact same random sets as before.

Setup

In [1]:
import os
import sys
import json
import pandas as pd
from IPython.display import IFrame
# ==== import DIGEST python package called biodigest ==== 
from pathlib import Path
sys.path.append(str(Path.cwd().parent))
from biodigest.single_validation import single_validation, save_results
from biodigest.evaluation.d_utils.plotting_utils import create_plots,create_extended_plots

import warnings
warnings.filterwarnings("ignore")

If it is the first time that this package is used after installation, make sure that setup has been run before to have all required files.

In [2]:
from biodigest import setup
setup.main(setup_type="api")
[00:00:00|202.15MB] Starting Setup ...
[00:00:00|202.15MB] Get id and attribute mappings ...
[00:00:01|218.15MB] Get distance mappings ...
[00:00:01|218.15MB] Get distance mappings for jaccard ...
[00:00:24|260.56MB] Get distance mappings for overlap ...
[00:00:38|243.29MB] Finished Setup ...

Clustering

Compare a target cluster of diseases or genes based on dunn index, sillhouette score or david bouldin index, while the random runs are simply cluster size preserving perturbation of cluster assignments.

Gene Cluster

Set input parameters

Required input

In [3]:
# ==== define required input ====
tar_cluster = pd.read_csv("input/target_gene_cluster.txt", header=None, sep="\t", dtype=str, names=["id", "cluster", "desc"])
tar_id_type = "symbol"
mode = "clustering"

Optional input (Default parameters are set)

In [4]:
# ==== define optional input influencing results ====
distance_measure="jaccard" # which distance measure should be chosen. Choice between jaccard and overlap
background_model="complete" # the only background model for clustering
runs = 1000 # how many random runs for empirical p-value estimation
perc = 100 # how many % of the original input should be perturbated for the background model

# ==== define optional input influencing saving of results ====
out_dir = "results/gene_cluster/" 
verbose=True # printing additional information during the run
prefix="gene_cluster"
In [5]:
results = single_validation(tar=tar_cluster, tar_id=tar_id_type, mode=mode, background_model=background_model, 
                            runs=runs, replace=perc, verbose=verbose, distance=distance_measure)
[00:00:00|243.29MB] Check for proper setup ...
[00:00:00|243.29MB] Starting validation ...
[00:00:00|243.29MB] Load mappings for input into cache ...
[00:00:01|663.54MB] Load distances for input into cache ...
[00:00:08|2458.82MB] Load input data ...
[00:00:08|2459.92MB] Validation of input ...
Missing values for pathway.kegg :9/20
[00:00:09|2459.92MB] Validation of random runs ...
[00:05:26|2461.54MB] Calculating p-values ...
[00:05:26|2461.54MB] Finished validation

Inspect results

Check all P-values

In [6]:
pd.DataFrame(results["p_values"]['values'])
Out[6]:
DI-based SS-based DBI-based
GO.BP 0.105894 0.159840 0.156843
GO.CC 0.813187 0.265734 0.267732
GO.MF 0.753247 0.306693 0.315684
KEGG 0.672328 0.374625 0.375624

Check validation values of input cluster

In [7]:
pd.DataFrame(results["input_values"]['values'])
Out[7]:
DI-based SS-based DBI-based
GO.BP 0.989928 0.015676 0.982069
GO.CC 0.987647 0.076410 0.925316
GO.MF 0.981591 0.084761 0.919973
KEGG 0.978832 0.047343 0.959126

Save results

Save full results as json and upper two tables as csv

In [8]:
save_results(results=results, prefix=prefix, out_dir=out_dir)

Create visualization

If flag for plot was set, you can see a plot with the p-values for every attribute and a mappability plot showing how many of the input ids had assigned values per attribute.

In [9]:
create_plots(results=results, mode=mode, tar=tar_cluster, tar_id=tar_id_type, out_dir=out_dir, prefix=prefix)
In [10]:
create_extended_plots(results=results, mode=mode, tar=tar_cluster, out_dir=out_dir, prefix=prefix)

Keep in mind, that after plotting, the resulting plots are also saved in the output directory.

Disease Cluster

Set input parameters

Required input

In [11]:
# ==== define required input ====
tar_cluster = pd.read_csv("input/target_disease_cluster.txt", header=None, sep="\t", dtype=str, names=["id", "cluster", "desc"])
tar_id_type = "ICD-10"
mode = "clustering"

Optional input (Default parameters are set)

In [12]:
# ==== define optional input influencing results ====
distance_measure="jaccard" # which distance measure should be chosen. Choice between jaccard and overlap
background_model="complete" # the only background model for clustering
runs = 1000 # how many random runs for empirical p-value estimation
perc = 100 # how many % of the original input should be perturbated for the background model

# ==== define optional input influencing saving of results ====
out_dir = "results/disease_cluster/" 
verbose=True # printing additional information during the run
prefix="disease_cluster"

Run script

In [13]:
results = single_validation(tar=tar_cluster, tar_id=tar_id_type, mode=mode, background_model=background_model, 
                            runs=runs, replace=perc, verbose=verbose, distance=distance_measure)
[00:00:00|2513.73MB] Check for proper setup ...
[00:00:00|2513.73MB] Starting validation ...
[00:00:00|2513.73MB] Load mappings for input into cache ...
[00:00:02|2518.80MB] Load distances for input into cache ...
[00:00:03|2739.49MB] Load input data ...
[00:00:03|2739.49MB] Validation of input ...
Missing values for disgenet.genes_related_to_disease :2/64
Missing values for disgenet.variants_related_to_disease :7/64
Missing values for ctd.pathway_related_to_disease :18/64
[00:00:03|2741.31MB] Validation of random runs ...
[00:08:47|2741.75MB] Calculating p-values ...
[00:08:47|2741.75MB] Finished validation

Inspect results

Check all P-values

In [14]:
pd.DataFrame(results["p_values"]['values'])
Out[14]:
DI-based SS-based DBI-based
related_genes 0.470529 0.010989 0.739261
related_variants 0.967033 0.000999 0.537463
KEGG 0.931069 0.877123 0.037962

Check validation values of input cluster

In [15]:
pd.DataFrame(results["input_values"]['values'])
Out[15]:
DI-based SS-based DBI-based
related_genes -7.332896 3.787885 19.701737
related_variants 0.509827 4.524287 9.950938
KEGG 0.423433 -0.726466 0.714666

Save results

Save full results as json and upper two tables as csv

In [16]:
save_results(results=results, prefix=prefix, out_dir=out_dir)

Create visualization

If flag for plot was set, you can see a plot with the p-values for every attribute and a mappability plot showing how many of the input ids had assigned values per attribute.

In [17]:
create_plots(results=results, mode=mode, tar=tar_cluster, tar_id=tar_id_type, out_dir=out_dir, prefix=prefix)
In [18]:
create_extended_plots(results=results, mode=mode, tar=tar_cluster, out_dir=out_dir, prefix=prefix)

Keep in mind, that after plotting, the resulting plots are also saved in the output directory.

Set itself

Validate a set by reference, either reference set or id, or reference-free. Two background models can be used.

Disease Set

Set input parameters

Required input

In [19]:
# ==== define required input ====
tar_set = pd.read_csv("input/target_disease_set.txt", header=None, sep="\t", dtype=str)[0]
tar_id_type = "mesh"
mode = "set"

Optional input (Default parameters are set)

In [20]:
# ==== define optional input influencing results ====
distance_measure="jaccard" # which distance measure should be chosen. Choice between jaccard and overlap
background_model= "term-pres" # the only background model for clustering
runs = 1000 # how many random runs for empirical p-value estimation
perc = 100 # how many % of the original input should be perturbated for the background model

# ==== define optional input influencing saving of results ====
out_dir = "results/disease_set/" 
verbose=True # printing additional information during the run
prefix="disease_set"

Run script

In [21]:
results = single_validation(tar=tar_set, tar_id=tar_id_type, mode=mode, background_model=background_model, 
                            runs=runs, replace=perc, verbose=verbose, distance=distance_measure)
[00:00:00|2744.84MB] Check for proper setup ...
[00:00:00|2744.84MB] Starting validation ...
[00:00:00|2744.84MB] Load mappings for input into cache ...
[00:00:02|2800.61MB] Load distances for input into cache ...
[00:00:03|2800.90MB] Validation of input ...
[00:00:03|2800.90MB] Validation of random runs ...
[00:01:55|2803.45MB] Calculating p-values ...
[00:01:55|2803.45MB] Finished validation

Inspect results

Check all P-values

In [22]:
pd.DataFrame(results["p_values"]['values'])
Out[22]:
JI-based
KEGG 0.000999
related_genes 0.000999
related_variants 0.003996

Check validation values of input cluster

In [23]:
pd.DataFrame(results["input_values"]['values'])
Out[23]:
JI-based
KEGG 0.443860
related_genes 0.203047
related_variants 0.031845

Save results

Save full results as json and upper two tables as csv

In [24]:
save_results(results=results, prefix=prefix, out_dir=out_dir)

Create visualization

If flag for plot was set, you can see a plot with the p-values for every attribute and a mappability plot showing how many of the input ids had assigned values per attribute.

In [25]:
create_plots(results=results, mode=mode, tar=tar_set, tar_id=tar_id_type, out_dir=out_dir, prefix=prefix)
In [26]:
create_extended_plots(results=results, mode=mode, tar=tar_set, out_dir=out_dir, prefix=prefix)

Gene Set

Set input parameters

Required input

In [27]:
# ==== define required input ====
tar_set = pd.read_csv("input/target_gene_set.txt", header=None, sep="\t", dtype=str)[0]
tar_id_type = "symbol"
ref_set = pd.read_csv("input/reference_gene_set.txt", header=None, sep="\t", dtype=str)[0]
ref_id_type = "symbol"
mode = "set-set"

Optional input (Default parameters are set)

In [28]:
# ==== define optional input influencing results ====
distance_measure="jaccard" # which distance measure should be chosen. Choice between jaccard and overlap
background_model="term-pres" # the only background model for clustering
runs = 1000 # how many random runs for empirical p-value estimation
perc = 100 # how many % of the original input should be perturbated for the background model
enriched=False

# ==== define optional input influencing saving of results ====
out_dir = "results/gene_set/" 
verbose=True # printing additional information during the run
prefix="gene_set"

Run script

In [29]:
results = single_validation(tar=tar_set, tar_id=tar_id_type, ref=ref_set, ref_id=ref_id_type, mode=mode, 
                            runs=runs, background_model=background_model, verbose=verbose, enriched=enriched, 
                            distance=distance_measure)
[00:00:00|2803.45MB] Check for proper setup ...
[00:00:00|2803.45MB] Starting validation ...
[00:00:00|2803.45MB] Load mappings for input into cache ...
[00:00:02|2800.53MB] Validation of input ...
[00:00:02|2800.53MB] Validation of random runs ...
[00:01:57|2802.31MB] Calculating p-values ...
[00:01:57|2802.31MB] Finished validation

Inspect results

Check all P-values

In [30]:
pd.DataFrame(results["p_values"]['values'])
Out[30]:
JI-based
GO.BP 0.017982
GO.CC 0.712288
GO.MF 0.458541
KEGG 0.067932

Check validation values of input cluster

In [31]:
pd.DataFrame(results["input_values"]['values'])
Out[31]:
JI-based
GO.BP 0.675
GO.CC 0.7
GO.MF 0.675
KEGG 0.45

Save results

Save full results as json and upper two tables as csv

In [32]:
save_results(results=results, prefix=prefix, out_dir=out_dir)

Create visualization

If flag for plot was set, you can see a plot with the p-values for every attribute and a mappability plot showing how many of the input ids had assigned values per attribute.

In [33]:
create_plots(results=results, mode=mode, tar=tar_set, tar_id=tar_id_type, out_dir=out_dir, prefix=prefix)
In [34]:
create_extended_plots(results=results, mode=mode, tar=tar_set, out_dir=out_dir, prefix=prefix)

Subnetwork

Subnetwork input parameters

Required input

In [35]:
# ==== define required input ====
tar_set = pd.read_csv("input/target_gene_subnetwork.txt", header=None, sep=",", dtype=str)[0]
tar_id_type = "symbol"
network_data = {"network_file":"input/gene_network.graphml", "prop_name":"name", "id_type":"symbol"}
mode = "subnetwork"

Optional input (Default parameters are set)

In [36]:
# ==== define optional input influencing results ====
distance_measure="jaccard" # which distance measure should be chosen. Choice between jaccard and overlap
background_model="network" # the only background model for clustering
runs = 1000 # how many random runs for empirical p-value estimation
perc = 100 # how many % of the original input should be perturbated for the background model
enriched=True

# ==== define optional input influencing saving of results ====
out_dir = "results/gene_network/" 
verbose=True # printing additional information during the run
prefix="gene_network"

Run script

In [37]:
results = single_validation(tar=tar_set, tar_id=tar_id_type, mode=mode, 
                            runs=runs, background_model=background_model, verbose=verbose, enriched=enriched, 
                            distance=distance_measure, network_data=network_data)
[00:00:00|2803.34MB] Check for proper setup ...
[00:00:00|2803.34MB] Starting validation ...
[00:00:00|2803.34MB] Load mappings for input into cache ...
[00:00:02|2799.47MB] Load distances for input into cache ...
[00:00:10|2833.23MB] Validation of input ...
Missing values for pathway.kegg :8/36
[00:00:11|2833.23MB] Validation of random runs ...
[00:02:03|2837.05MB] Calculating p-values ...
[00:02:03|2837.05MB] Finished validation

Inspect results

Check all P-values

In [38]:
pd.DataFrame(results["p_values"]['values'])
Out[38]:
JI-based
GO.BP 0.069930
GO.CC 0.476523
GO.MF 0.446553
KEGG 0.155844

Check validation values of input cluster

In [39]:
pd.DataFrame(results["input_values"]['values'])
Out[39]:
JI-based
GO.BP 0.026896
GO.CC 0.118837
GO.MF 0.115702
KEGG 0.031394

Save results

Save full results as json and upper two tables as csv

In [40]:
save_results(results=results, prefix=prefix, out_dir=out_dir)

Create visualization

If flag for plot was set, you can see a plot with the p-values for every attribute and a mappability plot showing how many of the input ids had assigned values per attribute.

In [41]:
create_plots(results=results, mode=mode, tar=tar_set, tar_id=tar_id_type, out_dir=out_dir, prefix=prefix)
In [42]:
create_extended_plots(results=results, mode=mode, tar=tar_set, out_dir=out_dir, prefix=prefix)
In [ ]: